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1. INTRODUCTION 

String matching algorithms are the algorithms utilized for scertaining the optimal alignment through the 
comparison of a set of patterns to the random string. Whilst engaging in the comparison stage, it is imperative that 
the pattern length be equal to the text window length. Additionally, the comparison between the pattern and the 
text window strings is dependent upon the identification of the match between them [1]. Typically, String 
Matching is usually utilized in numerous computer applications such as signal and image processing, artificial 
intelligence (AI), web search engines, intrusion detection systems, operating systems, speech and pattern 
recognition [2], [3], information retrieval [4], and computational biology and chemistry. Furthermore, in recent 
years, the string matching algorithms are considered as the main component utilized in the application of DNA 
pattern matching, and the analysis of Protein sequences [5], [6]. The development and growth rate of the database 
is escalating at a swift rate; thus the need for enhancing the performance of exact string matching algorithms. 

Certain exact known string matching algorithms such as Brute force, Boyer-Moore, Karp-Rabin and 
Knuth-Morris-Pratt (KMP), are prevalently and extensively utilized. It should be noted that Brute force 
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algorithm can be considered as the easiest form of algorithm. In this algorithm, the comparison operation 
between text and pattern window is executed from right to left. When the match or mismatch is obtained, the 
shifting process is executed to the right character only [2], [7]. The Boyer-Moore algorithm is prevalently and 
extensively utilized due to its high performance and efficiency. The comparison in this algorithm is initiated 
from the right to left. There is a possibility of a match or mismatch between the pattern and the text window, 
where it is noted that the shifting is dependent upon good suffix and the bad character functions [8]. As 
aforementioned, the Karp-Rabin algorithm is the algorithm that is utilized in the hash process, and the 
comparison of this algorithm is conducted from the left to right. It is dependent upon the calculation of the hash 
function of the pattern and the text window. The hash technique possesses a high performance and efficiency 
capability which reduces the consumption of time due to the use of integer numbers [9], [10]. The KMP 
algorithm is considered as the first liner algorithm in time. The comparison of this algorithm is executed from 
left to right, and the shifting process is dependent upon the last character [2]. 

The exact string matching algorithms are affected by three factors, which comprise the number of 
attempts, the number of character comparison, and the consumed time, where the efficient algorithms reduced 
the drawbacks of one or more of these factors. A majority of string matching algorithms enable the reduction 
on the number of attempts, and the number of character comparison in sequential performance, which however 
possess weakness in terms of the consumed time. Therefore, the researchers find that it is important to 
concentrate on the consumed time problems, through the utilization of the parallel processing to address this 
problem, as it affords the reduction of consumed time in exact string matching algorithms [11]. 

The parallel processing is defined as the practical way to reduce the computational time, which is 
dependent upon dealing with the cores or processors in computers to resolve the sequential computing problems 
[5], [11], [12]. Prevalently, in tandem with the increase in the enhanced development of computer technology, 
particularly in the architectures of the processors and multicores, there exist pertinent necessity for enhancing 
the performance of the exact string matching algorithms. So as to coincide with the enhancement of the system, 
and the reduction of the consumed time of the multicore and multiprocessors of computers. 

The computing of parallel process possesses great potentials of improving the data execution time, in 
comparison to sequential computing that consumes a longer duration of time to obtain the results [11]. There are 
numerous exact string matching algorithms that utilized parallel computing by utilizing either the multicore or 
multi-processor techniques, for example AKRAM algorithm. This algorithm employed the parallel 
multiprocessors model which comprises the message passing interface (MPI). The AKRAM algorithm that was 
adopted in the parallelization process utilized the technique of data decomposition, which segregated the data into 
numerous subparts and were distributed to the processors inside the cluster. The MPI multiprocessor model 
indicated a high performance computing ability in contrast to the performance of sequential computing in the 
AKRAM algorithm [13]. Additionally, there are exact string matching algorithms that used multiprocessor 
parallel technology, and among them is the Karp-Rabin algorithm, which divides the string into subsets and each 
individual ones are compared with the pattern separately. This algorithm obtained good results when large datasets 
were utilized. However it demonstrated inefficient results when utilized with short patterns. Additionally, there 
are alternative algorithms other than the exact string matching algorithm which utilized the multicore technology 
such as the quick search algorithm. The quick search algorithm used the OpenMP paradigm that reduced the 
execution time of the algorithm. The OpenMP technology operates through the utilization of the data 
decomposition technique, which divided the data into subsets through the fork and join process. The OpenMP 
technique demonstrated good performance in parallel time in comparison with sequential time [11]. 

The KMP algorithm employed the hybrid technology OpenMP/MPI, where this algorithm indicated high 
parallelization results with large string size. However, due to the communication time, this algorithm obtained 
low results when it utilized more than two clusters. Conversely it obtained very good results when it utilized two 
clusters only [14]. Moreover, in order to enhance the filtering of intrusion detection system (IDS), the Quick 
search algorithm utilized the multicore techniques which entail the Pthread (POSIX) and OpenMP paradigms. It 
employed these two multicore implementations so as to expedite speed (speedup) of the parallelized algorithm 
time for a swifter IDS [15]. The graphics processing units (GPU) technology was also utilized by the string 
matching algorithms such as Karp-Rabin algorithm. It demonstrated that when utilized, it indicated a difference 
in terms of cores, threads numbers, pattern, and string sizes with high speed of up to 23x in the parallel time, with 
the implementation of GPU implementation in comparison with the CPU implementation [16]. 

In this paper, we have redesigned the exact string matching algorithm termed as E-Atheer by utilizing the 
parallel Model with the aim of reducing the execution time and expediting the speeding (speedup) of the algorithm. 
Here, we evaluated the performance of the algorithm over different factors such as using different types of databases, 
pattern length, threads number, in addition to the number of cores. In section 2 describes the Algorithm and 
implementation that explain the technique of E-Atheer algorithm and Pthreads (POSIX) technique, the generation 
of Pthreads code for the E-Atheer algorithm, and the implementation and environment. The results are obtained in 
section 3, the discussions and analysis are presented in section 4, and conclusion is introduced in section 5. 
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2. ALGORITHM AND IMPLEMENTATION 
2.1. E-Atheer algorithm 

It involves two stages, which include the preprocessing stage and the searching stage. The 
preprocessing stage is reliant on the selected functions of two algorithms (Atheer and Berry-Ravindran) 
comprising the hashing function; Berry-Ravindran bad character (Brbc) function and Boyer-Moore bad 
character (Bmbc) function. Meanwhile, the searching stage in the algorithm is dependent upon the comparison 
of the hash values of three characters between the pattern and the text window. When a match is obtained, the 
comparison of the three characters between pattern and text window will then take place. In the event of a 
match, a comparison of the hash value of remaining characters from the second characters to (m/2-1) will be 
continued. Additionally, on the ensuing match obtained, it will then be followed by a comparison of the 
characters. In instances of further matching, it will then be followed by a comparison of the hash of the 
characters from (m/2+1) to the last charecter-1. Finally, if there is still a match obtained, this is followed by a 
comparison of the characters between pattern and text window. If there is a match or mismatch obtained for 
each step, the new shifting of the algorithm will based on the highest values between the m from bmBc table 
and (m+1 and m+2) values from brBc table [17]. 


2.2. POSIX threads 

The thread can be utilized as a separate flowing for the space of address in order to control it [18]. 
The threads are utilized in the parallel execution of the shared memory multiprocessor and multi-core 
architectures. The POSIX Pthreads is considered as a low level interface of programming for the operations with 
the OS threads [19]. The POSIX Pthreads indicated to the C language threads programs for UNIX by the IEEE 
the POSIX 1003.1c standard is utilized to generate different threads in the caller operation. In addition, one more 
use of the parallelization in the UNIX is the fork, which can generate a new operation that eventually can enable 
a new operation of the caller. The results of the experiments demonstrated that the Pthread is able to obtain 
additional enhanced results as compared with the fork. This is because the thread can be created with less operation 
system overhead in comparison to the fork, as the fork operation generates separate operation of execution [18]. 


2.3. Generating of pthreads code for E-ATHEER hybrid algorithm 

This section describes the main technical contribution that is generated by the parallel C program, 
which utilized the libraries of Pthreads. The following steps entail the parallel operations of the Pthread 
paradigm that is employed in this study: 

a) The first step in the parallel program of this study considered the control of fine grained texture by the 
management of the thread, which directly operates on the threads, and are capable of creating new thread 
functions by defining new THREAD (name), and then started as StartThread (name); function. The 
iMaxThreads function defines the maximum number of threads, and the threads numbered from 0 to P-1, 
where the P is the number of possible threads. 

Moreover, the initial step also utilized the Mutex functions which require the initialization before 
usage. It entails a predefined value that can be assigned for the static initializer: Pthread_Mutex_Initializer. 
The Pthread_mutex_lock () routine is utilized by the thread in order to obtain the lock on the Mutex variable. 
Additional, this stage of initial step also utilize the Pthread_mutex_unlock () routine, which is needed after the 
threading process is finished. The data is utilized if there are other threads needed to obtain the Mutex for their 
data usage. In addition to that, there are other functions that are used in the management of threads, which are 
WaitForThread (int iThread). This type of function waits until the thread numbered as iThread finishes its 
execution. The function WaitForAll () waits for all the threads. 

b) In the second step in the parallel program, the variables are shared over the threads in all steps of the 
shared section. The arguments that are used in the algorithm are length of text n, text y, length of pattern 
m, and pattern x. This step is also utilized in the function of bmBc and function of brBc that is used for 
the shifting in the E-Atheer algorithm. 

c) The third step in the parallel program is dependent upon the decomposition of the data, where the array 
y[ ] that is related to the text is divided into small chunks p. In addition the p chunks are treated using the 
singular threads in the parallel region. Each division cannot be precisely n/p, because of the searching 
technique of the pattern matching algorithm. Thus, the division process can be n/p+m-1. 

d) Inthe fourth step, after the division of the data text, each thread in the core takes one chunk, and the threads 
take the same pattern to each thread. It separately used the pattern with the specific chunk, and this is 
dependent upon the algorithm technique. In the E-Atheer algorithm, each thread utilizes the hash function 
for a number of three times, with pattern in three phases, alongside the first phase of the text window and 
searching phase. When each thread process is finished, the number of character comparisons, the calculation 
of the number of attempts, a comparison is made between the chunk and the pattern, and the consumed time. 
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e) In the fifth step, the reduction mutex is completed by calculating all the results from the threads. The 
function pthread_mutex_lock (Reduction) is utilized to copy the summarized results for each thread 
separately, and to calculate the values when each thread ends. In addition, through the utilization of the 
reduction function, the parallel algorithm is able to compute the number of character comparison, the 
number of attempts and consumed time for each thread are computed, which is ensued by the calculation 
of the final results for all the threads. Additionally, the fifth step utilized the pthread_mutex_unlock 
(Reduction); where its function is executed after the threads completed all the functions, and to restart the 
creation of new threads, as illustrated in Figure 1. 
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Figure 1. Flowchart for parallel of the E-Atheer algorithm using Pthread paradigm 
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2.4. Implementation and environment 
2.4.1. Hardware and Khawarizmi cluster architecture 

The experiment utilizes the Khawarizmi cluster that is available in the parallel lab of the School of 
Computer Science, in the Universiti Sains Malaysia (USM) (khawarizmi.cs.usm.my). This cluster possesses a 
single master node (2xQuad-Core Intel Xeon E5450 3.00 GHz, 2x12 MB Cache, 1333 MHz FSB) and two 
slave nodes (2xQuad-Core Intel Xeon 1.6 GHz, 2x4 MB Cache, 1066 MHz FSB). The cluster possesses three 
nodes, where inside each node there are two processors that possess four cores, with a single thread in each 
core. The operating system of this cluster is Linux (rocks cluster distribution 6.1, centos 6.3, 64-bit) and the 
compiler that is utilized in the cluster is GCC 4.4.6. 


2.4.2. Performance metrics 

The parallelization of the suggested algorithm is executed through the utilization of the Pthread. The 
evaluation of the algorithm results is conducted through the use of metrics, which is used to make a comparison 
between the sequential and parallel algorithms performance. These specific metrics involve the speedup, and 
the execution time [20]-[22]. 
A. Execution time 

The execution time between the starting (initiation) point and the ending (termination) point of a single 
processor, in addition to the entire operations time, is called the sequential time. The parallel time entails the 
consumed time from the moment of the beginning moment of the first processor until the moment of the 
finished time of the last processor. The consumed time in sequential is denoted as Ts and in the parallel is 
denoted as Tp. 
B. Speedup 

It is utilized to gain the benefit of the parallel process. The speedup is reliant on the ratio of the elapsed 
time in the sequential stage, to the elapsed time in the parallel stage. The Ts is the consumed time of the 
sequential phase, Tp is the consumed time of the parallel phase, and S is the speedup. The calculated time is in 
milliseconds and the measurement is dependent upon following equation. 


Speedup (S) = Ts /Tp 


2.4.3. Experiment design 

The databases that are utilized in this experiment is downloaded from the Pizza and Chili Corpus Web 
site (http://pizzachili.dcc.uchile.cl/ (Pizza Chili Corpus). The datasets that are utilized in this study are; DNA, 
Protein, XML, and Pitch with 200 MB data size. The average was calculated for the program after the 
implementation of each dataset for five times. This experiment is dependent upon a machine which is utilized 
in numerous prior studies. It utilized eight cores due to the fact that the cluster possesses three nodes, and inside 
each node there are 8 cores with a single thread for each individual one. Conversely, when more than 8 cores 
is utilized in this study, the results will be useless and produce a backfire. In this experiment, the figures 
employed the number of cores which are dependent upon the core power of two, between 2“! to 2/7. The time 
used in the results is (Seq) for sequential time, and (C2), (C4), and (C8) to represent the two, four, and eight 
numbers of cores respectively. The sequential results were compared with the cores’ parallel results in this 
study. To employ the comparison operation of databases and algorithms, in addition to make the attainment of 
the parallel results easier in this study, the average results were utilized. Two pattern lengths were utilized in 
this experiment: the length of short pattern, which extended from 4 characters to 28 characters, and the length 
of long pattern (length power of 2), which extended from 2/* characters to 2!° characters. Furthermore, the 
pattern lengths taken diverse colours when evaluated of the parallel times and the speedup. 


3. RESULTS 

In the parallel operation of E-Atheer algorithm, the dataset decomposed into several segments and 
were distributed to the cores by utilizing Pthreads (POSIX). The E-Atheer algorithm is evaluated by comparing 
the speedup, as well as, the parallel time and sequential time, when utilizing long and short pattern lengths for 
data of size 200 MB. The databases that are used within the experiment are diverse within the alphabet size, 
where this sort is utilized to analyze the behaviors of the algorithm within the different sizes of alphabet. 


3.1. Parallel and sequential times 

When comparing the parallel and sequential times, and when utilizing long and short pattern lengths 
with 200MB size of the database, the parallel time indicated the optimal performance rather than the sequential 
time. The Pitch databases indicated the optimal time in the most of the long and short pattern lengths. 
Meanwhile, the DNA database indicated the worst time achievement in the entire long and short pattern lengths, 
as indicated in Figures 2 and 3 respectively. 
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Figure 3. Evaluations of the parallel time utilizing long pattern lengths 


3.2. Speedup 

It indicated high results when using short pattern lengths. The results of algorithm are high with the 
utilization of DNA database in all the long pattern lengths. Meanwhile the other databases obtained good 
speedup results only when using two (2) cores. The speedup obtained good results when using 32 pattern 
lengths with 4 and 8 cores, whereas it is reduced when utilizing other long pattern lengths. The optimal database 
in the entire long and short pattern lengths is the DNA database, and contrarily the worst database is the XML 
in most of the short pattern lengths. Moreover, the Protien database being the worst in most of the long patterns, 
as indicated in Figures 4 and 5. 
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Figure 5. Evaluations of the speedup utilizing long pattern lengths 


4. DISCUSSIONS AND ANALYSIS 

Initially, the findings of the parallel execution time indicated the optimal performance in comparison 
to the sequential time. However, when the number of cores increased, the overhead is revealed to increase and 
to have an impact on the parallel execution time. This is due to the increase in the communication time. When 
the number of cores in parallel and pattern length increase, the parallel and sequential times are decrease in the 
long and short pattern lengths [13], [23]. 

The best sequential results registered are 359 and 95 ms for short and long patterns with 200 MB data 
size, separately. The worst sequential results registered are 699 and 483 ms for short and long patterns with 
200 MB data size, separately. The particular best parallel results when utilizing short pattern length are as 
follows: 2 cores 185 ms, 4 cores 96 ms, and 8 cores 59 ms. The best parallel results when utilizing long pattern 
length are as follows: 2 cores 56 ms, 4 cores 41 ms, 8 cores 35 ms. 
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The particular worst parallel results when utilizing short pattern length are as follows: 2 cores 358 ms, 
4 cores 189 ms, and 8 cores 120 ms. The worst parallel results when utilizing long pattern length are as follows: 
2 cores 251 ms, 4 cores 130 ms, and 8 cores 71 ms, as shown in Table 1. The Pitch dataset indicated the optimal 
sequential and parallel times in long and short pattern lengths due to the fact that the E-Atheer algorithm utilized 
the hash and (bmBc) efficient functions techniques. The DNA dataset obtained the worst results in sequential 
and parallel time owing to the minute DNA alphabet size, which is dependent on the hash function that needs 
to check the repeated characters. Thus, additional time is consumed than in other datasets. 


Table 1. Performance assessment for the average sequential and parallel execution times (ms) of the 
E-Atheer algorithm 


Database types Short pattern Long pattern 
Seq C2 C4 C8 Seq C2 C4 C8 
DNA 699 358 182 96 483 251 130— 71 
Protein 372 192 100 59 95 56 41 36 
XML 573 316 189 = 120 104 60 42 38 
Pitch 359 185, 96 59 100 58 41 35 


The speedup increased when utilizing the short pattern length, while the speedup is expedited in terms 
of time with long pattern length when only 2 cores are utilized, during the use of all types of dataset except for 
the DNA database. The best results of speedup when short patterns utilized with 200 MB data sizes are 
separatly displayed as follows: 2 cores 1.95, 4 cores 3.82, and 8 cores 7.19. The best results of speedup when 
long patterns utilized are separatly displayed as follows: 2 cores 1.93, 4 cores 3.72, and 8 cores 6.76. The worst 
results of speedup when short patterns utilized with 200 MB data sizes are separatly displayed as follows: 2 
cores 1.85, 4 cores, 3.28, and 8 cores 5.16. The worst results of speedup obtained when using long patterns are 
separatly displayed as follows: 2 cores 1.66, 4 cores 2.27, and 8 cores 2.56, as shown in Table 2. 


Table 2. Performance assesment for average speedup of parallel E-Atheer algorithm 


Database types Short pattern Long pattern 
C2 C4 C8 C2 C4 C8 
DNA 195 3.82 7.19 1.93 3.72 6.76 
Protein 1.93 3.63 5.98 167 2.27 2.56 
XML 185 3.28 5.16 171 242 2.74 
Pitch 1.93 366 5.9 166 2.29 2.69 


The DNA dataset obtained good performance with all number of cores because there are only four 
characters in the DNA dataset which possesses short parallel time in comparison to the sequential time for the 
same data type. Moreover, the parallel time decreased when the pattern length increased [23]. Furthermore, the 
algorithm technique is dependent upon the hash function that is used to calculate the hash value of three 
characters only in the first step. Therefore when more than 2 cores are utilized with large alphabet size and the 
long pattern, the elapsed time will be reduced and the shifting will be small unlike in the DNA dataset. 

The optimal speedup results are obtained through the use of the DNA database when utilizing short 
and long pattern lengths, as can be observed in the results, due to the DNA dataset obtaining high performance 
with parallel time in comparison to the results obtained using sequential time. The speedup indicated an 
increase when the sequential execution time is greater than the parallel time. The XML dataset has achieved 
the defective results when utilizing the short pattern length, meanwhile the Protien database obtained extremely 
terrible results when utilizing the long pattern length. The speedup decreased when the alphabet size increased 
as the speedup is affected by the database type [24], [25]. 


5. CONCLUSION 

The results in this study are represented by the parallel execution time, and speedup of the sequential 
and parallel of E-Atheer algorithm through the utilization of varying types of datasets with size 200MB, and 
with short and long lengths of patterns. The parallelization of E-Atheer algorithm obtained high performance 
results in comparison to sequential version when utilizing Pthread paradigm as a multi-core processing 
technology. Through our significant research, it is noted that the E-Atheer algorithm obtained optimal results 
and high performance in the parallelization by the reduction in the algorithm execution time, and indicated 
high results in speedup. In the parallel execution of E-Atheer algorithm, the Pitch database indicated optimal 
results in parallel execution time, and when utilizing long and short pattern lengths. Meanwhile the DNA 
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database obtained optimal results in speedup. For future work the E-Atheer algorithm may be progressed by 
executing to other multi core environment (e.g., GPU program) and multiprocessors models (e.g., MPI), as 
well as Executing hybrid parallel models, like the hybrid models of the OpenMP-MPI or GPU-MPI. 
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